Data is the fuel that keeps your models relevant and accurate. However, not all of it is helpful or useful. This is where Feature Engineering comes in. Feature Engineering ensures that the features you include in your model make an impact and are statistically significant. This tutorial will cover the following learning objectives:
What are Features?
What is Feature Engineering?
Feature Engineering Best Practices
What are Features?
Summary
In a Machine Learning context, Features are the variables (x) that you're using to predict the target variable (y).
Example: Suppose you're trying to predict the price of a home. You collect data from thousands of listings with the following features:
price
home_size
num_baths
num_beds
yard_size
location
Price would be your target variable (since that's what you're trying to predict), and Home Size, # of Bathrooms, # of Bedrooms, Yard Size, and Location are your features.
NOTE: Not all variables present in a dataset can, or should, be used as features. For instance, if you have an ID field, such as an employee or student ID, this is irrelevant for use in a model since it's a randomly assigned value.
What is Feature Engineering?
Summary
Feature Engineering is the process of cleaning and filtering data AFTER performing EDA to enhance your model's performance. Common tasks in this process include excluding outliers, excluding features that are not statistically significant, or imputing missing values.
It's easy to get confused on the difference between Feature Engineering and Data Wrangling. However, just remember that the purpose of Data Wrangling is to clean and prepare your data for EDA, whereas Feature Engineering is applying the insights you gained from the EDA process to your dataset to prepare and enhance your dataset for Model Training.
Obvious Feature Engineering involves fixing problems in your training data that can be seen without doing EDA. This can include excluding ID columns, creating calculated fields based on existing features to enhance your model's performance, or adding features from other datasets that are relevant to predicting your target variable.
Non-Obvious Feature Engineering involves fixing problems that you identify in the EDA process. This can include imputing missing values, excluding outliers, or excluding features that aren't correlated with the target variable.
Domain Knowledge is the process of understanding how a specific entity or process works. This typically refers to understanding how your business collects data, creates and uses KPIs to track performance, and general industry knowledge surrounding correlations between variables and knowing what certain acronyms or abbreviations mean. As you get more experience as an ML Engineer, you'll naturally gain domain knowledge either within the business you work for or the industry you're in.
NOTE: As mentioned in the video above, Feature Engineering can include encoding categorical values for use in Model Training. However, this step is typically associated with Data Preprocessing, the next step in the MLOps Lifecycle.
Feature Engineering Best Practices
Summary
Feature Selection is the process of selecting a subset of features present in your dataset AFTER performing EDA.
Feature Selection is a critical step in the Feature Engineering process and allows you to:
Reduce the dimensionality of your dataset. If your dataset has more than 100 features, there are likely several features you can trim to make your dataset more readable.
Improve Training Efficiency. When perfoming Hyperparameter Tuning (See Next Tutorial), it helps to have a smaller and more impactful subset of features to more efficiently tune your model's performance.
Remove Noise and Redundancies. If you're using a Gradient-Boosted Model, as discussed in our "Deep Learning" Tutorial Series (Coming Soon), irrelevant features will distract the model from performing calculations on impactful features that will ultimately lead to better results. Noise is what's created by these irrelevant features and distracts the model from optimizing itself on impactful features.
Intrinsic Methods use ensembles (random pairings of features) to find and optimize relationships between variables, which ultimately lead to a predictive outcome. Intrinsic Methods are used with the following models:
Tree-Based Models. A common example of this is a Random Forest which is used with Decision Trees to randomly pair small subsets of features to find the optimal route to the prediction while optimizing your evaluation metric(s).
Regularization Models. Models such as Linear Regression, Support Vector Machine (SVMs), and Neural Networks use various regularization methods to assign weights to features based on their impact on the predictive outcome. The higher the weight, the higher the impact. These are very good at reducing noise.
Intrinsic Methods Pros and Cons:
Pros:
Efficient, due to the fact that they are embedded in the algorithms.
No external tools or methods need to be used, leading to lower overhead costs.
Optimizes feature importance, leading to optimal model performance.
Cons:
Limited number of algorithms have these methods embedded. However, the algorithms that do utilize these methods are commonly used.
Filter Methods take advantage of descriptive statistics to visualize the relationships between the target variable and all the features. This approach can be universally applied, whereas Intrinsic Methods are naturally embedded in a limited number of algorithms. These methods can be applied in the following ways:
Univariate Statistical Analysis. This method is primarily used in the EDA process. With this method, you utilize inferential statistics to analyze the relationship between the target variable and another feature. Then, you set a benchmark score (such as a Correlation Score of at least 0.45) to filter out features until you have only the most relevant ones remaining.
Feature-Importance Based. This method is used AFTER you fitted your training data to your model with all features present. You then visualize the impact each feature had on predicitng the outcome. You use a benchmark score (such as the one provided by SciKit-Learn) to filter out features that aren't impactful beyond a certain measure. This is commonly used when comparing p-values with Linear and Logistic Regression.
Filter Methods Pros and Cons:
Pros:
Simple and fast to compute.
Can be used regardless of algorithm being used.
Cons:
Selection bias present. Commonly associated with selecting redundant or duplicate features.
Ignores relationships between features. Focuses only on relationships between features and the target variable.
Wrapper Methods take an iterative approach to feature selection. Rather than being embedded in an algorithm natively, these take random subsets of features, fit those subsets to the model, and output a result. The process ends when the model retrieves the best score possible (score varies based on algorithm being used). Wrapper Methods can be applied in the following ways:
Sequential Feature Selection (SFS). Tools such as SciKit-Learn have built-in capabilities that alow this process to occur in your ML pipeline. These use a greedy search algorithm to quickly take every possible combination of features, run it through the algorithm, and output an evaluation score. Once the iteration ends, it select the subset of features that produced the best result and uses those for the evaluation phase. There are two flavors of SFS:
Forward SFS starts with the feature that has the highest impact on the target variable, adds another feature, and then cross-validates the score until it retrieves a higher score. It loops through this process until it retrieves the best possible combination of features.
Backward SFS starts with ALL features present in the dataset and iteratively removes features one by one, starting with the feature with the least impact on the target variable, until it retrieves the best possible combination of features.
Wrapper Methods Pros and Cons:
Pros:
Extremely efficient for finding the optimal combination of features.
Considers all features available, rather than a subset.
Cons:
Prone to overfitting features.
Computationally intensive, especially when a large number of features is present.
NOTE: Domain Knowledge is critical in the Feature Selection process. Since you'll be working with samples of populations, you'll need to use inferential statistics to measure the actual impact a feature will have on the target variable. Domain Knowledge enables you to use a combination of intuition and data-driven insights to select the best features for your model.